NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

https://doi.org/10.1109/TCC.2024.3476210

Dhakal, Aditya; Kulkarni, Sameer G; Ramakrishnan, K K (October 2024, IEEE Transactions on Cloud Computing)

Full Text Available
SmartWatch: accurate traffic analysis and flow-state tracking for intrusion prevention using SmartNICs

https://doi.org/10.1145/3485983.3494861

Panda, Sourav; Feng, Yixiao; Kulkarni, Sameer G; Ramakrishnan, K. K.; Duffield, Nick; Bhuyan, Laxmi N. (December 2021, CoNEXT '21: Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies)

Despite advances in network security, attacks targeting mission critical systems and applications remain a significant problem for network and datacenter providers. Existing telemetry platforms detect volumetric attacks at terabit scales using approximation techniques and coarse grain analysis. However, the prevalence of low and slow attacks that require very little bandwidth, makes flow-state tracking critical to overall attack mitigation. Traffic queries deployed on network switches are often limited by hardware constraints, preventing them from carrying out flow tracking features required to detect stealthy attacks. Such attacks can go undetected in the midst of high traffic volumes. We design SmartWatch, a novel flow state tracking and flow logging system at line rate, using SmartNICs to optimize performance and simultaneously detect a number of stealthy attacks. SmartWatch leverages advances in switch based network telemetry platforms to process the bulk of the traffic and only forward suspicious traffic subsets to the SmartNIC. The programmable network switches perform coarse-grained traffic analysis while the SmartNIC conducts the finer-grained analysis which involves additional processing of the packet as a 'bump-in-the-wire'. A control loop between the SmartNIC and programmable switch tunes the queries performed in the switch to direct the most appropriate traffic subset to the SmartNIC. SmartWatch's cooperative monitoring approach yields 2.39 times better detection rate compared to existing platforms deployed on programmable switches. SmartWatch can detect covert timing channels and perform website fingerprinting more efficiently compared to standalone programmable switch solutions, relieving switch memory and control-plane processor resources. Compared to host-based approaches, SmartWatch can reduce the packet processing latency by 72.32%.
more » « less
Full Text Available
Assessing Container Network Interface Plugins: Functionality, Performance, and Scalability

https://doi.org/10.1109/TNSM.2020.3047545

Qi, Shixiong; Kulkarni, Sameer G.; Ramakrishnan, K. K. (March 2021, IEEE Transactions on Network and Service Management)
null (Ed.)
Full Text Available
ECML: Improving Efficiency of Machine Learning in Edge Clouds

https://doi.org/10.1109/CloudNet51028.2020.9335804

Dhakal, Aditya; Kulkarni, Sameer G; Ramakrishnan, K. K. (November 2020, 2020 IEEE 9th International Conference on Cloud Networking (CloudNet))
null (Ed.)
Edge cloud data centers (Edge) are deployed to provide responsive services to the end-users. Edge can host more powerful CPUs and DNN accelerators such as GPUs and may be used for offloading tasks from end-user devices that require more significant compute capabilities. But Edge resources may also be limited and must be shared across multiple applications that process requests concurrently from several clients. However, multiplexing GPUs across applications is challenging. With edge cloud servers needing to process a lot of streaming and the advent of multi-GPU systems, getting that data from the network to the GPU can be a bottleneck, limiting the amount of work the GPU cluster can do. The lack of prompt notification of job completion from the GPU can also result in poor GPU utilization. We build on our recent work on controlled spatial sharing of a single GPU to expand to support multi-GPU systems and propose a framework that addresses these challenges. Unlike the state-of-the-art uncontrolled spatial sharing currently available with systems such as CUDA-MPS, our controlled spatial sharing approach uses each of the GPU in the cluster efficiently by removing interference between applications, resulting in much better, predictable, inference latency We also use each of the cluster GPU's DMA engines to offload data transfers to the GPU complex, thereby preventing the CPU from being the bottleneck. Finally, our framework uses the CUDA event library to give timely, low overhead GPU notifications. Our evaluations show we can achieve low DNN inference latency and improve DNN inference throughput by at least a factor of 2.
more » « less
Full Text Available
Mu: An Efficient, Fair and Responsive Serverless Framework for Resource-Constrained Edge Clouds

https://doi.org/10.1145/3472883.3487014

Mittal, Viyom; Qi, Shixiong; Bhattacharya, Ratnadeep; Lyu, Xiaosu; Li, Junfeng; Kulkarni, Sameer G.; Li, Dan; Hwang, Jinho; Ramakrishnan, K. K.; Wood, Timothy (November 2021, Proceedings of the ACM Symposium on Cloud Computing)

Serverless computing platforms simplify development, deployment, and automated management of modular software functions. However, existing serverless platforms typically assume an over-provisioned cloud, making them a poor fit for Edge Computing environments where resources are scarce. In this paper we propose a redesigned serverless platform that comprehensively tackles the key challenges for serverless functions in a resource constrained Edge Cloud. Our Mu platform cleanly integrates the core resource management components of a serverless platform: autoscaling, load balancing, and placement. Each worker node in Mu transparently propagates metrics such as service rate and queue length in response headers, feeding this information to the load balancing system so that it can better route requests, and to our autoscaler to anticipate workload fluctuations and proactively meet SLOs. Data from the Autoscaler is then used by the placement engine to account for heterogeneity and fairness across competing functions, ensuring overall resource efficiency, and minimizing resource fragmentation. We implement our design as a set of extensions to the Knative serverless platform and demonstrate its improvements in terms of resource efficiency, fairness, and response time. Evaluating Mu, shows that it improves fairness by more than 2x over the default Kubernetes placement engine, improves 99th percentile response times by 62% through better load balancing, reduces SLO violations and resource consumption by pro-active and precise autoscaling. Mu reduces the average number of pods required by more than ~15% for a set of real Azure workloads.
more » « less
Full Text Available
Machine Learning at the Edge: Efficient Utilization of Limited CPU/GPU Resources by Multiplexing

https://doi.org/10.1109/ICNP49622.2020.9259361

Dhakal, Aditya; Kulkarni, Sameer G; Ramakrishnan, K. K. (October 2020, Proc. of Riding with AI towards Mission-Critical Communications and Computing at the Edge (AIMCOM2) Workshop in IEEE ICNP 2020)
null (Ed.)
Edge clouds can provide very responsive services for end-user devices that require more significant compute capabilities than they have. But edge cloud resources such as CPUs and accelerators such as GPUs are limited and must be shared across multiple concurrently running clients. However, multiplexing GPUs across applications is challenging. Further, edge servers are likely to require considerable amounts of streaming data to be processed. Getting that data from the network stream to the GPU can be a bottleneck, limiting the amount of work GPUs do. Finally, the lack of prompt notification of job completion from GPU also results in ineffective GPU utilization. We propose a framework that addresses these challenges in the following manner. We utilize spatial sharing of GPUs to multiplex the GPU more efficiently. While spatial sharing of GPU can increase GPU utilization, the uncontrolled spatial sharing currently available with state-of-the-art systems such as CUDA-MPS can cause interference between applications, resulting in unpredictable latency. Our framework utilizes controlled spatial sharing of GPU, which limits the interference across applications. Our framework uses the GPU DMA engine to offload data transfer to GPU, therefore preventing CPU from being bottleneck while transferring data from the network to GPU. Our framework uses the CUDA event library to have timely, low overhead GPU notifications. Preliminary experiments show that we can achieve low DNN inference latency and improve DNN inference throughput by a factor of ∼1.4.
more » « less
Full Text Available
GSLICE: controlled spatial sharing of GPUs for a scalable inference platform

https://doi.org/10.1145/3419111.3421284

Dhakal, Aditya; Kulkarni, Sameer G; Ramakrishnan, K. K. (October 2020, SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing)
null (Ed.)
The increasing demand for cloud-based inference services requires the use of Graphics Processing Unit (GPU). It is highly desirable to utilize GPU efficiently by multiplexing different inference tasks on the GPU. Batched processing, CUDA streams and Multi-process-service (MPS) help. However, we find that these are not adequate for achieving scalability by efficiently utilizing GPUs, and do not guarantee predictable performance. GSLICE addresses these challenges by incorporating a dynamic GPU resource allocation and management framework to maximize performance and resource utilization. We virtualize the GPU by apportioning the GPU resources across different Inference Functions (IFs), thus providing isolation and guaranteeing performance. We develop self-learning and adaptive GPU resource allocation and batching schemes that account for network traffic characteristics, while also keeping inference latencies below service level objectives. GSLICE adapts quickly to the streaming data's workload intensity and the variability of GPU processing costs. GSLICE provides scalability of the GPU for IF processing through efficient and controlled spatial multiplexing, coupled with a GPU resource re-allocation scheme with near-zero (< 100μs) downtime. Compared to default MPS and TensorRT, GSLICE improves GPU utilization efficiency by 60--800% and achieves 2--13X improvement in aggregate throughput.
more » « less
Full Text Available
Understanding Container Network Interface Plugins: Design Considerations and Performance

https://doi.org/10.1109/LANMAN49260.2020.9153266

Qi, Shixiong; Kulkarni, Sameer G; Ramakrishnan, K. K. (July 2020, 26th {IEEE} International Symposium on Local and Metropolitan Area Networks, {LANMAN} 2020, Orlando, FL, USA, July 13-15, 2020)

Kubernetes, an open-source container orchestration platform, has been widely adopted by cloud service providers (CSPs) for its advantages in simplifying container deployment, scalability and scheduling. Networking is one of the central components of Kubernetes, providing connectivity between different pods (group of containers) both within the same host and across hosts. To bootstrap Kubernetes networking, the Container Network Interface (CNI) provides a unified interface for the interaction between container runtimes. There are several CNI implementations, available as open-source ‘CNI plugins’. While they differ in functionality and performance, it is a challenge for a cloud provider to differentiate and choose the appropriate plugin for their environment. In this paper, we compare the various open source CNI plugins available from the community, qualitatively and through detailed quantitative measurements. With our experimental evaluation, we analyze the overheads and bottlenecks for each CNI plugin, as a result of the network model it implements, interaction with the host network protocol stack and the network policies implemented in iptables rules. The choice of the CNI plugin may also be based on whether intra-host or inter-host communication dominates.
more » « less
Full Text Available
Managing State for Failure Resiliency in Network Function Virtualization

https://doi.org/10.1109/LANMAN49260.2020.9153271

Kulkarni, Sameer G; Ramakrishnan, K. K.; Wood, Timothy (July 2020, 2020 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN)

Ensuring high scalability (elastic scale-out and consolidation), as well as high availability (failure resiliency) are critical in encouraging adoption of software-based network functions (NFs). In recent years, two paradigms have evolved in terms of the way the NFs manage their state - namely the Stateful (state is coupled with the NF instance) and a Stateless (state is externalized to a datastore) manner. These two paradigms present unique challenges and opportunities for ensuring high scalability and high availability of NFs and NF chains. In this work, we assess the impact on ensuring the correctness of NF state including the implications of non-determinism in packet processing, and carefully analyze and present the benefits and disadvantages of the two state management paradigms. We leverage OpenNetVM and Redis in-memory datastore to implement both state management paradigms and empirically compare the two. Although the stateless paradigm is desirable for elastic scaling, our experimental results show that, even at line-rate packet processing (10 Gbps), stateful NFs can achieve chain-level failover across servers in a LAN incurring less than 10% performance. The state-of-the-art stateless counterparts incur severe throughput penalties. We observe 30-85% overhead on normal processing, depending on the mode of state updated to the externalized datastore.
more » « less
Full Text Available
Understanding Open Source Serverless Platforms: Design Considerations and Performance

https://doi.org/10.1145/3366623.3368139

Li, Junfeng; Kulkarni, Sameer G.; Ramakrishnan, K. K.; Li, Dan (December 2019, Proceedings of the 5th International Workshop on Serverless Computing, WOSC@Middleware 2019, Davis, CA, USA)

Serverless computing is increasingly popular because of the promise of lower cost and the convenience it provides to users who do not need to focus on server management. This has resulted in the availability of a number of proprietary and open-source serverless solutions. We seek to understand how the performance of serverless computing depends on a number of design issues using several popular open-source serverless platforms. We identify the idiosyncrasies affecting performance (throughput and latency) for different open-source serverless platforms. Further, we observe that just having either resource-based (CPU and memory) or workload-based (request per second (RPS) or concurrent requests) auto-scaling is inadequate to address the needs of the serverless platforms.
more » « less
Full Text Available

« Prev Next »

Search for: All records